Recurrent neural networks

Per Unneberg

19 January, 2022

Overview

Recap

Sequential models

Recurrent neural networks (RNNs)

LSTMs and GRUs

Practical applications

Recap

Perceptron (single neuron)

Architecture

A single neuron has \(n\) inputs \(x_i\) and an output \(y\). To each input is associated a weight \(w_i\).

Activity rule

The activity rule is given by two steps:

\[a = \sum_{i} w_ix_i, \quad i=0,...,n\]

\[\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}\]

(MacKay, 2003)

Perceptron (single neuron)

Architecture

A single neuron has \(n\) inputs \(x_i\) and an output \(y\). To each input is associated a weight \(w_i\).

Activity rule

The activity rule is given by two steps:

\[a = \sum_{i} w_ix_i, \quad i=0,...,n\]

\[\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}\]

(MacKay, 2003)

Perceptron (single neuron)

\[a = w_0 + \sum_{i} w_ix_i, \quad i=1,...,n\]

\[y = y(a) = g\left( w_0 + \sum_{i=1}^{n} w_ix_i \right)\]

or in vector notation

\[y = g\left(w_0 + \mathbf{X^T} \mathbf{W} \right)\]

where:

\[\quad\mathbf{X}= \begin{bmatrix}x_1\\ \vdots \\ x_n\end{bmatrix}, \quad \mathbf{W}=\begin{bmatrix}w_1\\ \vdots \\ w_n\end{bmatrix}\]

(Alexander Amini, 2021)

Simplified illustration and notation

Architecture

Vectorized versions: input \(\boldsymbol{x}\), weights \(\boldsymbol{w}\), output \(\boldsymbol{y}\)

Activity rule

\[a = \boldsymbol{wx}\]

Feed forward network

Simplified illustration

Simplified illustration

Sequential models

Motivation

Motivation

Motivation

Motivation

Sequences around us

Word prediction

Language translation

Time series

(Herzen et al., 2021)

Genomics

(Shen et al., 2018)

Types of models

one to one

many to one

one to many

many to many

Image classification

Sentiment analysis

Image captioning

Machine translation

(Karpathy, 2015)

Recurrent Neural Networks (RNNs)


Feed forward network implementation to sequential data

Assume multiple time points.

Feed forward network implementation to sequential data

Assume multiple time points.

Feed forward network implementation to sequential data

Assume multiple time points.

  • Dependency of inputs not modelled such that ambiguous sequences cannot be be distinguished:

“dog bites man” vs “man bites dog”

Feed forward network implementation to sequential data

Assume multiple time points.

  • Time points are modelled individually ( \(\hat{Y}_t = f(X_t)\) )

Feed forward network implementation to sequential data

Assume multiple time points.

  • Time points are modelled individually ( \(\hat{Y}_t = f(X_t)\) )
  • Also want dependency on previous inputs ( \(\hat{Y}_t = f(..., X_2, X_1)\) )

Adding recurrence relations

Adding recurrence relations

Adding recurrence relations

Adding recurrence relations

Folded representation

Unfolded representation

Add a hidden state \(h\) that introduces a dependency on the previous step:

\[ \hat{Y}_t = f(X_t, h_{t-1}) \]

Sequential memory of RNNs

RNNs have what one could call “sequential memory” (Phi, 2020)

Alphabet

Exercise: say alphabet in your head

    A B C ... X Y Z

Modification: start from e.g. letter F

May take time to get started, but from there on it’s easy

Now read the alphabet in reverse:

    Z Y X ... C B A

Memory access is associative and context-dependent

Recurrent Neural Networks


Add recurrence relation where current hidden cell state \(h_t\) depends on input \(x_t\) and previous hidden state \(h_{t-1}\) via a function \(f_W\) that defines the network parameters (weights):

\[ h_t = f_\mathbf{W}(x_t, h_{t-1}) \]

Note that the same function and weights are used across all time steps!

Recurrent Neural Networks - pseudocode


class RNN:
  # ...
  # Description of forward pass
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

rnn = RNN()
ff = FeedForwardNN()

for word in input:
    output = rnn.step(word)

prediction = ff(output) 

Vanilla RNNs


Output vector

\[ \hat{Y}_t = \mathbf{W_{hy}^T}h_t \]

Update hidden state

\[ h_t = \mathsf{tanh}(\mathbf{W_{xh}^T}X_t + \mathbf{W_{hh}^T}h_{t-1}) \]

Input vector

\[ X_t \]

Vanilla RNNs


(Olah, 2015)

Vanilla RNNs


Vanilla RNNs


Vanilla RNNs


Note: \(\mathbf{W_{xh}}\), \(\mathbf{W_{hh}}\), and \(\mathbf{W_{hy}}\) are shared across all cells!

Desired features of RNN

  1. Variable sequence lengths

Not all inputs are of equal length

  1. Long-term memory

“I grew up in England, and … I speak fluent English”

  1. Preserve order

“dog bites man” != “man bites dog”

  1. Share parameters

Adresses points 2 and 3.

Example: Box & Jenkins airline passenger data set

(Onnen, 2021)

Example: generate test and training data

Partition time series into training and test data sets at an e.g. 2:1 ratio:

import rnnutils
import numpy as np
df = rnnutils.airlines()
data = np.array(df['passengers'].values.astype('float32')).reshape(-1, 1)
train, test, scaler = rnnutils.make_train_test(data) 

Example: prepare data for keras

time_steps = 12
trainX, trainY, trainX_indices, trainY_indices = rnnutils.make_xy(train, time_steps)
testX, testY, testX_indices, testY_indices = rnnutils.make_xy(test, time_steps) 

Example: create vanilla RNN model

from keras.models import Sequential
from keras.layers import Dense, SimpleRNN

model = Sequential()
model.add(SimpleRNN(units=3, input_shape=(time_steps, 1),
                    activation="tanh"))
model.add(Dense(units=1, activation="tanh"))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary() 
## Model: "sequential"
## _________________________________________________________________
## Layer (type)                 Output Shape              Param #   
## =================================================================
## simple_rnn (SimpleRNN)       (None, 3)                 15        
## _________________________________________________________________
## dense (Dense)                (None, 1)                 4         
## =================================================================
## Total params: 19
## Trainable params: 19
## Non-trainable params: 0
## _________________________________________________________________

Example: fit the model and evaluate

history = model.fit(trainX, trainY, epochs=20, batch_size=1, verbose=2)
Ytrainpred = model.predict(trainX)
Ytestpred = model.predict(testX) 
rnnutils.plot_history(history) 
data = {'train': (model.predict(trainX), train, trainY_indices),
        'test': (model.predict(testX), test, testY_indices)}
rnnutils.plot_pred(data, scaler=scaler, ticks=range(0, 144, 20), labels=df.year[range(0, 144, 20)]) 

Example: model topology writ out

## Model: "sequential"
## _________________________________________________________________
## Layer (type)                 Output Shape              Param #   
## =================================================================
## simple_rnn (SimpleRNN)       (None, 3)                 15        
## _________________________________________________________________
## dense (Dense)                (None, 1)                 4         
## =================================================================
## Total params: 19
## Trainable params: 19
## Non-trainable params: 0
## _________________________________________________________________

Example: model topology writ out

## Model: "sequential"
## _________________________________________________________________
## Layer (type)                 Output Shape              Param #   
## =================================================================
## simple_rnn (SimpleRNN)       (None, 3)                 15        
## _________________________________________________________________
## dense (Dense)                (None, 1)                 4         
## =================================================================
## Total params: 19
## Trainable params: 19
## Non-trainable params: 0
## _________________________________________________________________

(Verma, 2021)

NB! In keras, RNN input is a 3D tensor with shape [batch, timesteps, feature]

An RNN in numbers

(Karpathy, 2015)

Example network trained on “hello” showing activations in forward pass given input “hell”. The outputs contain confidences in outputs (vocabulary={h, e, l, o}). We want blue numbers high, red numbers low. P(e) is in context of “h”, P(l) in context of “he” and so on.

What is the topology of the network?

4 input units (features), 4 time steps, 3 hidden units, 4 output units

Exercise

See if you can improve the airline passenger model. Some things to try:

  • change the number of units
  • change time_steps
  • change the number of epochs

Training

Recap: backpropagation algorithm in ffns

(Alexander Amini, 2021)

Recap: backpropagation algorithm in ffns

(Alexander Amini, 2021)


  1. perform forward pass and generate prediction

Recap: backpropagation algorithm in ffns

(Alexander Amini, 2021)


  1. perform forward pass and generate prediction
  2. calculate prediction error \(\epsilon_i\) wrt (known) output: \(\epsilon_i = \mathcal{L}(\hat{y}_i, y_i)\), loss function \(\mathcal{L}\)

Recap: backpropagation algorithm in ffns

(Alexander Amini, 2021)


  1. perform forward pass and generate prediction
  2. calculate prediction error \(\epsilon_i\) wrt (known) output: \(\epsilon_i = \mathcal{L}(\hat{y}_i, y_i)\), loss function \(\mathcal{L}\)
  3. back propagate errors and update weights to minimize loss

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Backpropagation through time (BPTT)

(Alexander Amini, 2021)

Errors are propagated backwards in time from \(t=t\) to \(t=0\).

Problem: calculating gradient may depend on large powers of \(\mathbf{W_{hh}}^{\mathsf{T}}\) (e.g. \(\delta\mathcal{L} / \delta h_0 \sim f((\mathbf{W_{hh}}^{\mathsf{T}})^t)\)

The effect of vanishing gradients on long-term memory

In layer \(i\) gradient size ~ \((\mathbf{W_{hh}}^{\mathsf{T}})^{t-i}\)

\(\downarrow\)

Weight adjustments depend on size of gradient

\(\downarrow\)

Early layers tend to “see” small gradients and do very little updating

\(\downarrow\)

Bias parameters to learn recent events

\(\downarrow\)

RNN suffer short term memory

(Olah, 2015)

“The clouds are in the _


“I grew up in England … I speak fluent _

Solutions to vanishing gradient

  1. Activation function

    ReLU (or leaky ReLU) instead of sigmoid or tanh

  1. Weight initialization

    Set bias=0, weights to identity matrix

  1. More complex cells using “gating”

    For example LSTM

LSTMs and GRUs

Motivation behind LSTMs and GRUs

LSTM

GRU

Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) architectures were proposed to solve the vanishing gradient problem.

Intuition

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)

Intuition

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)


Remember the important parts, pay less attention to (forget) the rest.

LSTM: Cell state flow and gating

(Olah, 2015)

LSTM adds cell state that in effect provides the long-term memory

Information flows in the cell state from \(c_{t-1}\) to \(c_t\).

Gates affect the amount of information let through. The sigmoid layer outputs anything from 0 (nothing) to 1 (everything).

(Cho et al., 2014)

In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.

Forget, input, and output gates

forget gate

Purpose: reset content of cell state

input gate

Purpose: decide when to read data into cell state

output gate

Purpose: read entries from cell state

Sigmoid squishes vector \([\boldsymbol{h_{t-1}}, \boldsymbol{x_t}]\) (previous hidden state + input) to \((0, 1)\) for each value in cell state \(c_{t-1}\), where 0 means “reset entry”, 1 “keep it”

The forget gate

Purpose: decide what information to keep or throw away

Sigmoid squishes vector \([\boldsymbol{h_{t-1}}, \boldsymbol{x_t}]\) (previous hidden state + input) to \((0, 1)\) for each value in cell state \(c_{t-1}\), where 0 means “forget entry”, 1 “keep it”

\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]

Add new information - the input gate

Two steps to adding new information:

  1. sigmoid layer decides which values to update

Add new information - get candidate values

Two steps to adding new information:

  1. sigmoid layer decides which values to update
  2. tanh layer creates vector of new candidate values \(\tilde{c}_t\)

\[ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c) \]

Updating the cell state

  1. multiply old cell state by \(f_t\) to forget what was decided to forget
  2. add new candidate values scaled by how much we want to update them \(i_t * \tilde{c}_t\)

\[ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t \]

Cell output

Output is filtered version of cell state.

  1. sigmoid output gate decides what parts of cell state to output
  2. push cell state through tanh and multiply by sigmoid output

\[ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t) \]

LSTM: putting it together

Intuition

  • if forget ~ 1, input ~ 0, \(c_{t-1}\) will be saved to next time step (input irrelevant for cell state)
  • if forget ~ 0, input ~ 1, pay attention to the current input

LSTM: putting it together

(Zhang et al., 2021)

\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\\ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c)\\ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t\\ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t) \]

\[ x_t \in \mathbb{R}^{n\times d}, h_{t-1} \in \mathbb{n \times h}, i_t \in \mathbb{R}^{n\times h}, f_t \in \mathbb{R}^{n\times h}, o_t \in \mathbb{R}^{n\times h}, \]

and

\[ W_f \in \mathbb{R}^{n \times (h+d)}, W_i \in \mathbb{R}^{n \times (h+d)}, W_o \in \mathbb{R}^{n \times (h+d)}, W_c \in \mathbb{R}^{n \times (h+d)} \]

GRU

  • forget and input states combined to single update gate
  • merge cell and hidden state
  • simpler model than LSTM

Exercise

1. Analyse airline passengers with LSTM

Modify the airline passenger model to use an LSTM and compare the results. Try out different parameters to improve test predictions.

2. Time-series forecasting with LSTM, discrete state space

LSTM with Variable Length Input Sequences to One Character Output

Time-series forecasting with LSTM, discrete state space

Objective

Predict next character in sequence of strings

Comments

  • you could use several LSTM layers, in which all but the last layer should return sequences (set return_sequences=True) (Brownlee, 2017)

Bibliography

Alexander Amini. (2021). MIT 6.S191: Recurrent Neural Networks. https://www.youtube.com/watch?v=qjrad0V0uJE
Brownlee, J. (2017). Stacked Long Short-Term Memory Networks. In Machine Learning Mastery. https://machinelearningmastery.com/stacked-long-short-term-memory-networks/
Cho, K., Merrienboer, B. van, Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [Cs, Stat]. http://arxiv.org/abs/1406.1078
Herzen, J., Lässig, F., Piazzetta, S. G., Neuer, T., Tafti, L., Raille, G., Pottelbergh, T. V., Pasieka, M., Skrodzki, A., Huguenin, N., Dumonal, M., Kościsz, J., Bader, D., Gusset, F., Benheddi, M., Williamson, C., Kosinski, M., Petrik, M., & Grosch, G. (2021). Darts: User-friendly modern machine learning for time series. https://arxiv.org/abs/2110.03224
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Karpathy, A. (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. https://karpathy.github.io/2015/05/21/rnn-effectiveness/
MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms (Illustrated edition). Cambridge University Press.
Olah, C. (2015). Understanding LSTM Networks – colah’s blog. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Onnen, H. (2021). Temporal Loops: Intro to Recurrent Neural Networks for Time Series Forecasting in Python. In Medium. https://towardsdatascience.com/temporal-loops-intro-to-recurrent-neural-networks-for-time-series-forecasting-in-python-b0398963dc1f
Phi, M. (2020). Illustrated Guide to Recurrent Neural Networks. In Medium. https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9
Shen, Z., Bao, W., & Huang, D.-S. (2018). Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Scientific Reports, 8(1), 15270. https://doi.org/10.1038/s41598-018-33321-1
Verma, S. (2021). Understanding Input and Output shapes in LSTM Keras. In Medium. https://shiva-verma.medium.com/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive into deep learning. arXiv Preprint arXiv:2106.11342.